Lab 4: Extending Logistic Regression
Cameron Matson
Zihao Mao
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
In this lab we aim to predict the position of an NBA player given some set of descriptors and statistics of their play. The original data set contains the player information and stats from every player that played in the NBA from 1950 to present.
Professional sports teams in general, and basketball teams in particular, have turned in recent years to data to gain an edge over their opponents. Part of the responsibilities of the team ownership and coaching staff are to assemble a team of players that give them best chance to win. Players on an NBA team have specific roles, largely based on the position that they play. It is, therefore, important that player play in the posision that most helps their team to win. We would like to create a classifier that would help NBA teams make decisions on which position players should play, which may or may not necessarilly be the position they have actually played. This could especially be the case for players coming from college, where teams generally run systems quite different from that employed in the NBA.
What our classifier will do, then is look at the stats and player details for each position and then, given a new set of stats and player make a probalistic estimation of which position that player plays. If the result diverges from the players listed position, this might be an indication that the player is not playing the correct position.
With respect to accuracy, its important to keep in mind that our classifier is meant to aid basketball personel decisions, not make them authoritatively. It's primary use is as a tool to compare a player to the performance and description of a players in the past, and report on the similarities the unkonwn player has to the different positions historically. So while we'd like to see as close to perfection in terms of prediction on splits of our dataset, what will ultimately be most useful to us and NBA teams is if the reported probability of a player playing any given position is significantly higher than that of the other possible positions.
# first lests load the datasets in
data_path = '../data/basketball'
players = pd.read_csv(os.path.join(data_path, 'players.csv'))
players.head()
We probably don't need the "collage [sic]" or the their birth locationg. Probably don't really need their birth year either, but it might be interesting to look at generational splits. Also that unnamed column looks just like the index, so we can drop that too.
players.drop(['Unnamed: 0', 'collage', 'birth_city', 'birth_state', 'born'], axis=1, inplace=True)
players.info()
Good. They're all non null, and seem to be the correct datatype.
Now let's load the players stats.
stats = pd.read_csv(os.path.join(data_path, 'seasons_stats.csv'))
stats.info()
There are a lot of fields here, and they're pretty inconsistently filled. Some of this arises from the fact that its such a long timeline. For example, in 1950, there was no such thing as a 3-pointer, so it wouldn't make sense for those players to have 3pt% stats.
Inspecting the dataset a little further, we notice that there is no stat for points per game (PPG). The total number of points scored is listed, but that is hard to compare across seasons where they played different games. To make the dataset more valid, i.e. to make the points column a valid comparisson measure, we'll only consider seasons in which they played the current full 82 game schedule. Which doesn't reduce the power of the dataset by that much, they moved to a 82 game season in 1967, and only the lockout shortened 1998-99 season didn't have a full scehdule.
Actually we might want to limit it to just seasons after 1980 when they introduced the 3 pointer. That should just make the prediction task easier, although we lose even more of the dataset. But if we consider the business case as being how to decide players posisitions TODAY it makes sense.
stats = stats[stats.Year >= 1980]
stats = stats[stats.Year != 1998]
stats.info()
Now, to start, lets just focus on a few categories
And of course our label: position. We could probably use any of the features as a label actually, and see if one could predict performance in one aspect of the game based on info in the another. But for now we'll stick with predicting position.
stats_to_keep = {'Player', 'Pos', 'G', 'MP', 'FG', 'FGA', 'FT', 'FTA',
'2P', '2PA', '3P', '3PA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
'TOV', 'PF', 'PTS'}
stats_to_drop = set(stats.columns)-stats_to_keep
stats.drop(stats_to_drop, axis=1, inplace=True)
stats.info()
Okay. Finally, let's add the player description data to the stats dataframe.
stats['height'] = np.nan
stats['weight'] = np.nan
iplayer = players.set_index(keys='Player')
istats = stats.reset_index(drop=True)
for i, row in istats.iterrows():
name = row[0]
h = iplayer.loc[name].loc['height']
w = iplayer.loc[name].loc['weight']
# height and weight show up in the last two columns
istats.iloc[i, len(istats.columns) - 2] = h
istats.iloc[i, len(istats.columns) - 1] = w
stats = istats
# and now we don't need the names anymore
stats.drop(['Player'], axis=1, inplace=True)
Finally let's convert the position from a string to a number 1-5, and divide up the dataset into data and label (X and y). There are technically more than 5 listed (some are multiple positions listed,) but we'll just go off of the first-listed, primary position.
# first we need to separate out the label from the data
y = stats.Pos
set(y)
lets reclassify this numerically and only count their 'primary ' position, so that each player will be given a position 1-5 {(1, pg), (2, sg), (3, sf), (4, pf), (5, c)}
import numpy as np
def convert_pos(y):
newy = np.zeros((len(y), 1))
for i, player in enumerate(y):
if (player[0] == 'C'):
newy[i] = 5
elif (player[0:2] == 'PF'):
newy[i] = 4
elif (player[0:2] == 'SF'):
newy[i] = 3
elif (player[0:2] == 'SG'):
newy[i] = 2
elif (player[0:2] == 'PG'):
newy[i] = 1
return newy
y = convert_pos(y)
y
# let's update it in the dataframe just for fun
stats['Pos'] = y
from sklearn import preprocessing
# now we can drop the label, and we'll scale it to zero mean
X = stats.drop(['Pos'], axis=1)
X = preprocessing.scale(X, axis=1)
X
Before we go any further let's see what some of the distributions are
# first, how many of each class
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook
graph1 = {'labels': ['PG', 'SG', 'SF', 'PF', 'C'],
'values': np.bincount(y[:,0].astype(np.int32)-1), # -1 because it's looking for 0
'type': 'pie'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Total Class Distribution',
'autosize':False,
'width':500,
'height':400}
plotly.offline.iplot(fig)
Pretty even across the board. Not a requirement but nice to know. Also it makes sense. There are 5 unique positions in basketball, none of them are doubled (at least in theory), therefore there should be approximately the same number of them.
# now let's look at the distribution for a few of the features
sns.kdeplot(data=stats[stats.Pos == 1].height)
sns.kdeplot(data=stats[stats.Pos == 2].height)
sns.kdeplot(data=stats[stats.Pos == 3].height)
sns.kdeplot(data=stats[stats.Pos == 4].height)
sns.kdeplot(data=stats[stats.Pos == 5].height)
plt.legend(['PG', 'SG', 'SF', 'PF', 'C'])
plt.show()
This matches common reasoning that PG (the 1) are the the players and C (the 5) are the tallest. I bet the height and weight are correllatted too.
sns.jointplot(x="height", y="weight", data=stats);
Good intuition. What this means is that if the weight on height is high in our model it likely will be high for 'weight' (the feature) as well, and vice versa.
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
from sklearn.decomposition import PCA
n_components = 19
print ("Extracting the top %d eigenfaces from %d faces" % (
n_components, stats.shape[1]))
pca = PCA(n_components=n_components)
%time
stat_pca = pca.fit(X.copy()).transform(X.copy())
pca.components_.shape
plot_explained_variance(pca)
According to the diagram, we can reach a sufficient level of explained variance with 5 components. We can probably deal with the accuracy that 98% of the actual information given the reduction from ~25 features to 5 which will greatly increase the speed of our analysis.
n_components = 5
print ("Extracting the top %d eigenfaces from %d faces" % (
n_components, stats.shape[1]))
pca = PCA(n_components=n_components)
%time
stats_pca = pca.fit(X.copy()).transform(X.copy())
pca.components_.shape
print(stats_pca)
stats_pca = pd.DataFrame(stats_pca, columns = ['first', 'second','third','forth','fifth'])
stats_nopos = stats
stats_nopos.drop(['Pos'], axis=1, inplace=True)
pca_absolute = np.absolute(pca.components_)
import plotly.graph_objs as go
import plotly.plotly as py
import plotly
from random import randint
plotly.offline.init_notebook_mode() # run at the start of every notebook
colors = []
for i in range(26):
colors.append('%06X' % randint(0, 0xFFFFFF))
trace1 = go.Bar(
x=stats.columns,
y=pca_absolute[0],
name='First',
marker=dict(
color=colors)
)
trace2 = go.Bar(
x=stats.columns,
y=pca_absolute[1],
name='Second',
marker=dict(
color=colors)
)
trace3 = go.Bar(
x=stats.columns,
y=pca_absolute[2],
name='Third',
marker=dict(
color=colors)
)
trace4 = go.Bar(
x=stats.columns,
y=pca_absolute[3],
name='Forth',
marker=dict(
color=colors)
)
trace5 = go.Bar(
x=stats.columns,
y=pca_absolute[4],
name='Fifth',
marker=dict(
color=colors)
)
data = [trace1, trace2, trace3, trace4, trace5]
layout = go.Layout(
barmode = 'group'
)
fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig, filename = 'grouped-bar')
As we can see from the graph, there are certain features that are weighted heavily in most of the top 5 principal components (and significantly in at least one), namely minutes played, 3-point attmempts, 2-point attempts, total rebounds, assists, and height.
This is important to note because it might help inform our decision given the output of the classifier if we look at these features for a given input.
From here on, we'll be using these 5 components instead of the full feature set in order to speed up calculations. We can be reasonably sure we are not losing much information because these 5 components explain 98% of the total variance.
X = stats_pca.values
X = preprocessing.scale(X, axis=1)
X.shape
Now we'll just start out by defining the Binary Logistic Regression classes as we did in class.
class BinaryLogisticRegressionBase:
# private:
def __init__(self, eta, iterations=20):
self.eta = eta
self.iters = iterations
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
return 'Base Binary Logistic Regression Object, Not Trainable'
# convenience, private:
@staticmethod
def _sigmoid(theta):
return 1/(1+np.exp(-theta))
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
# public:
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_) # return the probability y=1
def predict(self,X):
return (self.predict_proba(X)>0.5) #return the actual prediction
blr = BinaryLogisticRegressionBase(0.1)
print(blr)
# inherit from base class
class BinaryLogisticRegression(BinaryLogisticRegressionBase):
#private:
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained Binary Logistic Regression Object'
def _get_gradient(self,X,y):
# programming \sum_i (yi-g(xi))xi
gradient = np.zeros(self.w_.shape) # set gradient to zero
for (xi,yi) in zip(X,y):
# the actual update inside of sum
gradi = (yi - self.predict_proba(xi,add_bias=False))*xi
# reshape to be column vector and add to gradient
gradient += gradi.reshape(self.w_.shape)
return gradient/float(len(y))
# public:
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
# for as many as the max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # multiply by learning rate
blr = BinaryLogisticRegression(0.1)
print(blr)
# now lets do some vectorized coding
import numpy as np
from scipy.special import expit
class VectorBinaryLogisticRegression(BinaryLogisticRegression):
# inherit from our previous class to get same functionality
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
# but overwrite the gradient calculation
def _get_gradient(self,X,y):
yhat = self.predict_proba(X, add_bias=False).ravel()[:, np.newaxis]
ydiff = y-yhat # get y difference
gradient = np.mean(X * ydiff, axis=0) # make ydiff a column vector and multiply through
return gradient.reshape(self.w_.shape)
Compare one vs. all. Starting with the 'Center' position which is #5.
An 80/20 split is a reasonable split for our dataset, since we have a large number of instances, and each of the classes is essentially equally represented (this doesn't really matter since the split is stratified anyway, but it's a nice perk.) We're also not dealing with any time series, or at least we're not defining them to be as such.
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
yb = (y>4).astype(np.int)
params = dict(eta=0.2,
iterations=1000)
blr = VectorBinaryLogisticRegression(**params)
num_cv_iterations = 10
num_instances = len(yb)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
test_size = 0.2)
score = np.zeros((10,)).astype(np.float)
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,yb)):
blr.fit(X[train_indices],yb[train_indices]) # train object
y_hat = blr.predict(X[test_indices]) # get test set precitions
# print the accuracy and confusion matrix
print("====Iteration",iter_num," ====")
s = accuracy_score(yb[test_indices],y_hat)
print("accuracy", s)
score[iter_num] = s
print("mean accuracy:", np.mean(score), "(", np.std(score),")")
means = np.zeros((5,))
stds = np.zeros((5,))
for i in range(5):
yb = (y==i+1).astype(np.int)
params = dict(eta=0.2,
iterations=1000)
blr = VectorBinaryLogisticRegression(**params)
num_cv_iterations = 10
num_instances = len(yb)
cv_object = ShuffleSplit(
n_splits=num_cv_iterations,
test_size = 0.2)
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,yb)):
blr.fit(X[train_indices],yb[train_indices]) # train object
y_hat = blr.predict(X[test_indices]) # get test set precitions
s = accuracy_score(yb[test_indices],y_hat)
score[iter_num] = s
means[i] = np.mean(score)
stds[i] = np.std(score)
print("mean accuracy: ", means[i], "+-", stds[i])
data = [go.Bar(
x=['PG', 'SG', 'SF','PF','C'],
y=means,
error_y=dict(
type='data',
array=stds,
),
)]
layout = go.Layout(
title = "Accuracies for each classes"
)
fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig, filename='basic-bar')
Really small error bars, you love to see that. Meaning, that our train/test splits produce consistent results, and should be pretty representative of real data.
class LogisticRegression:
def __init__(self, eta, iterations=20):
self.eta = eta
self.iters = iterations
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = y==yval # create a binary problem
# train the binary classifier for this class
blr = VectorBinaryLogisticRegression(self.eta,self.iters)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
lr = LogisticRegression(0.1,1500)
print(lr)
%%time
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(0.1,1000)
lr.fit(X_train,y_train)
print(lr)
yhat = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
Now we'll use various optimization techniques and varying parameters to see if we can get better results.
First we'll set up all the classes that we need.
# from last time, our logistic regression algorithm is given by (including everything we previously had):
class BinaryLogisticRegression:
def __init__(self, eta, iterations=20, C=0.001, regularization='L2'):
self.eta = eta
self.iterations = iterations
self.C = C
self.regularization = regularization
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained Binary Logistic Regression Object'
# convenience, private:
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
# vectorized gradient calculation with regularization
def _get_gradient(self,X,y):
ydiff = y-self.predict_proba(X, add_bias=False) # get y difference
gradient = np.mean(X * ydiff, axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
# gradient will be different depending on which regularization technique we use
L1 = np.sign(self.w_[1:]) * self.C
L2 = -2 * self.w_[1:] * self.C
if (self.regularization=='L1'):
gradient[1:] += L1
elif (self.regularization == 'L2'):
gradient[1:] += L2
elif (self.regularization == 'L1/L2'):
gradient[1:] += (L1 + L2)
return gradient
# public:
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_) # return the probability y=1
def predict(self,X):
return (self.predict_proba(X)>0.5) #return the actual prediction
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
# for as many as the max iterations
for _ in range(self.iterations):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # multiply by learning rate
Test it out
blr = BinaryLogisticRegression(eta=0.1,iterations=500,C=0.1)
blr.fit(X,yb)
print(blr)
yhat = blr.predict(X)
print('Accuracy of: ',accuracy_score(yb,yhat))
Stochastic Class
class StochasticLogisticRegression(BinaryLogisticRegression):
# stochastic gradient calculation
def _get_gradient(self,X,y):
idx = int(np.random.rand()*len(y)) # grab random instance
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
gradient[1:]
L1 = np.sign(w) * self.C
L2 = -2 * self.w_[1:] * self.C
if (self.regularization=='L1'):
gradient[1:] += L1
elif (self.regularization == 'L2'):
gradient[1:] += L2
elif (self.regularization == 'L1/L2'):
gradient[1:] += (L1 + L2)
return gradient
Implementation of BFGS
from scipy.optimize import fmin_bfgs
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
@staticmethod
def objective_function(w,X,y,C):
g = expit(X @ w)
return -np.sum(np.log(g[y==1]))-np.sum(np.log(1-g[y==0])) + C*sum(w**2) #-np.sum(y*np.log(g)+(1-y)*np.log(1-g))
@staticmethod
def objective_gradient(w,X,y,C):
g = expit(X @ w)
ydiff = y-g # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(w.shape)
if (self.regularization=='L1'):
gradient[1:] += L1
elif (self.regularization == 'L2'):
gradient[1:] += L2
elif (self.regularization == 'L1/L2'):
gradient[1:] += (L1 + L2)
return gradient
# just overwrite fit function
def fit(self, X, y):
# overwrite self.iters, because we know we don't need it
self.iters = 3
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = fmin_bfgs(self.objective_function, # what to optimize
np.zeros((num_features,1)), # starting point
fprime=self.objective_gradient, # gradient function
args=(Xb,y,self.C), # extra args for gradient and objective function
gtol=1e-03, # stopping criteria for gradient, |v_k|
maxiter=self.iters, # stopping criteria iterations
disp=False)
self.w_ = self.w_.reshape((num_features,1))
Note that all of the above are BINARY classifiers. I.e. they make a one vs. all comparrison. To get multiclass we need to do that for all of the classes. The class below does just that. Its the same as the multiclass implementation above, modifed to have a choice of optimization and regularization method, as well as some requisite helper functions to be used with SKLearn's GridSearchCV.
Using GridSearchCV will allow us to optimize the free parameters we've created (iteration, C, optimization technique, and regularization method) in order to find the best solution (as defined by the score() function which we have said is just the accuracy)
import warnings
class LRClassifier:
def __init__(self, eta=0.1, iterations=20, C=0.001, optimization='full', regularization='L2'):
self.eta = eta
self.iterations = iterations
self.C = C
self.opt = optimization
self.regularization = regularization
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = y==yval # create a binary problem
# train the binary classifier for this class
if (self.opt == 'stochastic'):
blr = StochasticLogisticRegression(self.eta, self.iterations, self.C, self.regularization)
elif (self.opt == 'bfgs'):
blr = BFGSBinaryLogisticRegression(_, 2, self.C, self.regularization)
else:
blr = BinaryLogisticRegression(self.eta, self.iterations, self.C, self.regularization)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
def score(self,*args, **kwargs):
yhat = self.predict(X)
return accuracy_score(y, yhat+1)
# lifted straight out of sklearn source code with some modifications
@classmethod
def _get_param_names(cls):
# this is just specific to this classifier
return sorted(['eta', 'iterations', 'C', 'optimization', 'regularization'])
def get_params(self, deep=True):
"""Get parameters for this estimator.
Parameters
----------
deep : boolean, optional
If True, will return the parameters for this estimator and
contained subobjects that are estimators.
Returns
-------
params : mapping of string to any
Parameter names mapped to their values.
"""
out = dict()
for key in self._get_param_names():
# We need deprecation warnings to always be on in order to
# catch deprecated param values.
# This is set in utils/__init__.py but it gets overwritten
# when running under python3 somehow.
warnings.simplefilter("always", DeprecationWarning)
try:
with warnings.catch_warnings(record=True) as w:
value = getattr(self, key, None)
if len(w) and w[0].category == DeprecationWarning:
# if the parameter is deprecated, don't show it
continue
finally:
warnings.filters.pop(0)
# XXX: should we rather test if instance of estimator?
if deep and hasattr(value, 'get_params'):
deep_items = value.get_params().items()
out.update((key + '__' + k, val) for k, val in deep_items)
out[key] = value
return out
def set_params(self, **params):
"""Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
``<component>__<parameter>`` so that it's possible to update each
component of a nested object.
Returns
-------
self
"""
if not params:
# Simple optimization to gain speed (inspect is slow)
return self
valid_params = self.get_params(deep=True)
# changed from six.iteritems() bc no need for py2 vs py3 compatabillity
for key, value in params.items():
split = key.split('__', 1)
if len(split) > 1:
# nested objects case
name, sub_name = split
if name not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(name, self))
sub_object = valid_params[name]
sub_object.set_params(**{sub_name: value})
else:
# simple objects case
if key not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(key, self.__class__.__name__))
setattr(self, key, value)
return self
Let's try it out! This will take some time
%%time
from sklearn.model_selection import GridSearchCV
parameters = {'iterations': [20, 50, 100],
'C':np.logspace(base=10, start=-3, stop=1, num=5),
'optimization': ['full', 'stochastic', 'bfgs'],
'regularization': ['L1', 'L2', 'L1/L2']}
lrc = LRClassifier(eta=0.1)
clf = GridSearchCV(lrc, parameters)
clf.fit(X, y)
cvresults = pd.DataFrame(clf.cv_results_)
cvresults
print("The most accurate model was:", clf.best_score_)
print("was found using the following parameters:")
print(clf.best_estimator_.get_params())
Let's take a closer look at how performance varied over various parameters. Holding the optimization and iterations parameters constant, how did the regularization affect the results?
batch = cvresults[cvresults.param_optimization == 'full']
batch100 = batch[batch.param_iterations == 100]
sns.pointplot(batch100.param_C.values, batch100.mean_test_score.values, hue=batch100.param_regularization)
plt.title('Accuracy vs. Reg Constant C over different Reg Techniques')
plt.show()
Looks like the method of regularization is relatively insignificant for small values of C but as it increases the L1 method prevails. In general however, the smaller the value of C the better the performance.
Which, if any, of the optimization techniques performed the fastest?
L1L2 = cvresults[cvresults.param_regularization == 'L1/L2']
L1L2_C1 = L1L2[L1L2.param_C == .001]
sns.pointplot(L1L2_C1.param_iterations.values, L1L2_C1.mean_fit_time.values, hue=L1L2_C1.param_optimization)
plt.title('Fit time vs. # of Iterations by Optimization Technique')
plt.show()
Looks like the optimization techniques aren't really optimizing much. They all have about the same fit_time.
SKLearn has been helping us a lot here. Let's see how we compare against their implementation
from sklearn.linear_model import LogisticRegressionCV as SKLogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
lr_sk = SKLogisticRegression() # all params default
%time lr_sk.fit(X_train,y_train.ravel())
print(np.hstack((lr_sk.intercept_[:,np.newaxis],lr_sk.coef_)))
yhat = lr_sk.predict(X_test)
best_score = accuracy_score(y_test.ravel(),yhat)
print("The most accurate model was:", best_score)
print("was found using the following parameters:")
print(lr_sk.get_params())
print('values of C (for each class, bc sklearn is dope like that)', lr_sk.C_)
SKLearn is reporting a different set of parameters to optimize the accuracy, albeit for about the same accuracy.
mimic = cvresults[(cvresults.param_optimization == 'bfgs') & (cvresults.param_iterations == 100) & (cvresults.param_regularization == 'L2')]
mimic
These are roughly the same value for small values of C (mean_test_score ~= 0.44) but it's hard to compare because sklearn has a different C for each class, and we are only using one.
print("The most accurate model was:", clf.best_score_)
print("was found using the following parameters:")
print(clf.best_estimator_.get_params())
Even though it's hard to determine which method is better. We would recommond the method from sklearn. One of the main reason is the accuracy of the predictions. At the first time when we tried to implement the logic regression, we didn't perform the PCA to reduce the features to 5. We had more than 20 features. The accuracy we can get from our method is about 45% while the one from sklearn can get about 56%.
This might be because sklearn is using different value of C for each class as mentioned before, or some other implicit process they did inside their method which can be why theirs took longer time to execute. But we can assume their method can perform better with a larger number of features than ours.
x1 = stats
x1 = preprocessing.scale(x1, axis=1) # with 21 features
x1.shape
X_train, X_test, y_train, y_test = train_test_split(x1, y, test_size=0.2)
%%time
#execute with our method:
# we use the parameters that performs the best from above.
lrc = LRClassifier(eta=0.1, C = 0.10000000000000001, iterations = 100, optimization = 'full', regularization = 'L1')
lrc.fit(X_train, y_train)
yhat = lrc.predict(X_test)
score = accuracy_score(y_test,yhat+1)
print("The accuracy:", score)
%%time
lr_sk = SKLogisticRegression() # all params default
lr_sk.fit(X_train,y_train.ravel())
yhat = lr_sk.predict(X_test)
score = accuracy_score(y_test.ravel(),yhat)
print("The accuracy:", score)